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Abstract —  Wireless  sensor  network  (WSN)  testbeds  are  useful 
because  they  provide  a  way  to  test  applications  in  an  environ¬ 
ment  that  makes  it  easy  to  deploy  experiments,  configure  them 
statically  or  dynamically,  and  gather  performance  information. 
Sensor  data  collected  in  the  field  can  be  replayed  on  nodes,  and 
new  ways  to  process  the  data  can  be  tested  easily.  Testbeds 
are  rapidly  growing  in  size,  with  hundreds  or  thousands  of 
devices,  and  testbed  services  are  also  becoming  richer  and  more 
complex.  Due  to  their  size  and  complexity,  faults  can  (and  do) 
occur  in  these  testbeds,  affecting  the  outcomes  of  experiments. 
Awareness  of  testbed  health  status  is  important  to  both  testbed 
administrators  charged  with  maintaining  functional  services,  and 
users  who  prefer  to  use  healthy  devices  and  like  to  know  if  there 
are  any  failures  during  their  experiments. 

Based  on  our  experience  with  Kansei,  a  large  WSN  testbed 
at  Ohio  State,  we  identify  use  cases  that  motivate  the  design 
of  Chowkidar,  a  health  monitoring  facility.  Key  among  these 
are:  monitoring  as  a  service  that  operates  independently  of  users 
to  provide  up-to-date  testbed  status  information;  monitoring  of 
heterogeneous  devices  over  a  mixture  of  IP  and  non-IP  networks; 
distinguishing  between  node  and  interface  failures;  and  use 
of  network  dependency  information  to  diagnose  common-mode 
failures  such  as  power  supply  or  Ethernet  hub  failure.  We  then 
present  a  centralized  and  a  distributed  Chowkidar  protocol  that 
reliably  monitor  the  health  of  large,  heterogenous  WSN  testbeds 
and  experimentally  compare  their  performance.  We  report  on 
initial  experiences  and  lessons  learnt  from  the  integration  of 
Chowkidar  with  Kansei,  including  feedback  from  both  testbed 
users  and  administrators  who  have  found  Chowkidar  to  be  a 
useful  tool  for  improving  the  accuracy  and  efficiency  of  testbed 
experimentation  and  maintenance,  and  the  need  for  well-defined 
policies  to  address  issues  such  as  minimizing  interference  with 
concurrently  running  experiments.  Finally,  we  discuss  extensions 
that  enhance  the  functionality  and  usability  of  Chowkidar. 


I.  Introduction 

Wireless  sensor  networks  (WSNs)  have  gained  in 
popularity,  due  to  their  potential  for  use  in  a  variety 
of  applications  such  as  perimeter  security  and  intrusion 
detection,  structural  monitoring,  industrial  sensing  and 
control,  medical  applications,  etc.;  however  developing  and 
fielding  one  is  hard.  Simulations  are  a  useful  way  to  debug 
code  and  get  basic  protocols  working,  but  they  do  not  take 
into  account  the  realities  of  radio  communication,  power 
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consumption,  unanticipated  faults,  and  the  like.  Hence  WSNs 
tested  via  simulations  often  do  not  work  when  deployed 
in  the  field.  On  the  other  hand,  experimenting  with  a 
field-deployed  WSN  is  not  practical  due  to  significant  time 
and  labor  overheads.  When  something  does  not  work  in  a 
WSN  application,  understanding  why  can  be  difficult  since 
memory,  power,  bandwidth  and  reliability  limit  a  developer’s 
ability  to  instrument  the  network.  A  testbed,  designed  to 
support  experimentation  with  actual  devices  in  a  realistic 
environment,  provides  an  effective  compromise  between 
simulation  and  deployment  that  can  speed  WSN  development 
by  providing  a  supporting  infrastructure  to  run,  configure  and 
monitor  experiments. 


(a)  Physical  layout  of  Kansei 


(b)  A  multi-device  Kansei  node 


Fig.  1.  The  Kansei  testbed  at  Ohio  State. 


Key  to  testbed  efficacy  is  a  reliable  infrastructure  through 
which  deployment,  configuration,  monitoring  and  data  re¬ 
trieval  for  an  experiment  can  be  done  with  minimal  interfer¬ 
ence  with  the  experiment  itself.  Large  testbeds  have  hundreds 
or  thousands  of  devices,  using  different  types  of  hardware 
and  software.  One  example  of  such  a  large,  multi-platform 
testbed  is  Kansei  [1],  which  we  have  developed  at  Ohio  State 
and  is  shown  in  Figure  1.  Kansei  currently  houses  several 
hundred  WSN  devices  of  different  types,  whose  characteristics 
are  listed  in  Table  I. 


A.  The  Need  for  Health  Monitoring 

The  use  of  WSN  testbeds  is  rapidly  evolving  from  simple 
testing  and  debugging  of  protocols  to  complex,  multi-phase 
experimentation  in  which  experiments  are  scripted  so  that 
when  a  particular  run  completes,  the  data  generated  is  analyzed 
and  a  new  set  of  parameters,  likely  to  give  better  results, 


Device  type 

XSM 

TelosB 

Stargate 

Processor 

4MHz 

8MHz 

400MHz 

RAM 

4KB 

10KB 

32MB 

OS 

TinyOS 

TinyOS 

Linux 

Interfaces 

CC1000, 

Serial 

CC2420, 

USB 

Ethernet,  802.11b, 
Serial,  USB 

Bandwidth 

38.4kbps 

250kbps 

1 1Mbps 

TABLE  I 

Different  platforms  in  Kansei. 


is  automatically  selected  and  the  experiment  is  re-run.  Such 
scenarios  are  common  in  testing  routing  or  MAC  protocols 
where  many  different  parameters  such  as  queue  length,  backoff 
intervals,  and  the  like  may  have  to  be  tuned  to  select  the 
combination  that  works  best.  At  present,  Kansei  already  sup¬ 
ports  interactive  experimentation  in  which  a  user  can  visualize, 
in  real  time,  the  data  produced  by  an  experiment  and  select 
new  parameters  that  can  then  be  injected  by  Kansei  into 
the  experiment.  Automated  execution  of  scripted  experiments 
will  soon  become  part  of  Kansei’s  scheduling  and  experiment 
management  service. 

Given  the  relatively  unreliable  nature  of  WSN  devices,  faults 
may  occur  before  or  during  an  experiment  that  affect  the 
quality  of  data  produced  in  a  run.  Even  if  devices  are  reliable, 
software  bugs  or  incorrect  device  configurations  could  lead  to 
faults.  While  designing,  maintaining  and  experimenting  using 
Kansei  over  the  last  two  years,  we  have  encountered  a  variety 
of  faults,  which  we  believe  are  also  applicable  to  other  WSN 
testbeds.  These  faults  can  be  categorized  as  follows. 

•  Device  fail-stop  faults.  A  fail-stop  results  in  the  complete 
failure  of  a  device.  Fail-stops  may  occur  as  a  result  of 
hardware  failure  or  due  to  software  crashes  that  render 
the  device  completely  unresponsive.  The  impact  of  a  fail- 
stop  fault  may  depend  on  the  type  of  the  affected  device 
in  heterogenous  testbeds  such  as  Kansei.  For  example, 
the  fail-stop  of  an  XSM  mote  simply  results  in  that  mote 
being  unavailable  for  user  experimentation.  However,  the 
fail-stop  of  a  Stargate  has  much  more  impact  since  a 
Stargate  is  used  by  Kansei  to  program  and  log  data  from 
XSM  and  TelosB  motes  that  are  attached  to  it  via  their 
serial  and  USB  interfaces.  Similarly,  the  fail-stop  of  an 
Ethernet  hub  results  in  loss  of  wired  connectivity  to  all 
of  its  attached  Stargates  and  in  turn  their  attached  motes. 
Since  the  wired  network  is  used  by  Kansei  for  instru¬ 
mentation  and  data  retrieval  during  an  experiment,  these 
fail-stops  render  the  affected  testbed  regions  unavailable. 

•  Network  interface  faults.  A  network  interface  fault  at  a 
device  results  in  the  device  being  unable  to  communicate 
over  that  interface.  Network  interface  faults  may  occur 
due  to  driver  failure  or  loose  hardware  connections  such 
as  an  unplugged  Ethernet  cable,  an  unseated  802.11b 
wireless  card  or  a  detached  Stargate-mote  connector. 


Since  most  testbeds  have  separate  experimentation  and 
instrumentation  interfaces,  the  failure  of  a  single  network 
interface  does  not  render  a  device  unreachable.  Failure 
of  all  its  network  interfaces,  however,  results  in  a  device 
being  partitioned  from  the  testbed. 

•  Software  faults.  A  software  fault  results  in  a  physically 
correct  device  or  network  interface  being  driven  into  a 
corrupted  or  bad  state.  These  faults  may  occur  due  to 
bad  or  buggy  code  or  due  to  misconfigured  software.  For 
example,  a  user  program  may  change  the  radio  power 
level  to  a  very  low  value  or  change  the  transmission  fre¬ 
quency,  so  that  neighbors  cannot  receive  radio  messages 
sent  by  this  device.  Another  type  of  configuration  fault 
occurs  if  the  Stargate  or  its  attached  mote  changes  the 
configuration  of  some  pin  in  their  serial  connector  to  an 
incompatible  state. 

If  these  faults  are  not  detected  and  accounted  for  while  ana¬ 
lyzing  experimental  results,  users  or  automated  test  programs 
that  consume  them  could  end  up  deriving  incorrect  conclusions 
or  making  incorrect  parameter  choices  that  adversely  affect  the 
performance  of  subsequent  runs.  It  is  therefore  imperative  for 
health  status  information  to  be  available  before,  during  and 
after  an  experiment  so  that  the  results  produced  by  a  testbed 
experiment  can  be  accurately  analyzed. 

In  addition  to  node  health  status,  diagnostic  services  pro¬ 
vided  by  monitoring  can  help  administrators  distinguish  be¬ 
tween  a  testbed  fault  and  an  error  caused  by  the  experiment 
itself.  This  is  important  since  some  faults  may  have  the  same 
visible  effect  as  others.  For  example,  a  802.11b  wireless 
driver  process  crash  has  the  same  effect  as  the  card  becoming 
unseated.  Diagnosis  of  such  faults  can  help  administrators 
determine  the  appropriate  correction  actions  needed  to  restore 
the  testbed  to  a  fully  functional  state.  Knowing,  for  example, 
via  alternate  network  interfaces  that  the  wireless  driver  has 
crashed,  an  administrator  could  simply  reload  it,  whereas  an 
unplugged  card  would  require  the  administrator  to  go  to  the 
testbed  and  physically  re-insert  it. 

We  therefore  developed  Chowkidar,  Hindi  for  “watchman”, 
as  a  service  that  provides  health  data  about  testbed  resources, 
periodically  as  well  as  on  demand.  Users  (or  a  testbed  schedul¬ 
ing  service)  can  use  the  information  provided  by  Chowkidar 
to  ensure  that  experiments  run  only  on  working  devices  by 
checking  health  before  and  after  an  experiment.  They  can 
also  better  assess  experimental  results  knowing  whether  or  not 
some  failures  occurred  during  the  experiment.  Administrators 
can  use  Chowkidar  to  learn  about  testbed  faults  and  any 
available  diagnostic  information  to  help  determine  the  most 
effective  response. 

B.  Monitoring  Requirements 

The  general  network  monitoring  problem  may  be  stated 
as  follows:  to  identify  which  network  objects  are  functioning 
correctly,  and,  for  those  that  are  not,  to  identify  them  and  so 
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far  as  possible,  indicate  why  they  are  not  functioning.  Objects 
typically  include  network  resources  such  as  links,  nodes,  and 
so  forth,  but  may  also  include  things  such  as  application 
components  and  processes. 

In  our  practical  experience  with  Kansei  testbed,  various 
recurring  use  cases  have  caused  us  to  add  a  number  of  specific 
requirements  to  monitoring  WSN  testbeds. 

•  Reliability.  For  monitoring  results  to  be  useful,  they 
should  be  reliable.  Reliability  has  two  important  dimen¬ 
sions;  first,  the  reported  results  should  be  consistent 
with  the  actual  state  of  the  testbed  and  second,  they 
should  be  complete  so  that  if  a  node  is  working  and  is 
reachable,  this  status  should  be  known  to  the  monitor. 
Due  to  resource  constraints  such  as  limited  bandwidth 
and  energy,  WSNs  often  use  sampling-based  monitoring 
where  the  overall  state  of  the  network  is  estimated  based 
on  data  received  from  a  subset  of  nodes.  The  goal 
of  a  testbed  health  monitoring  service  however,  is  to 
provide  ground  truth  information  against  which  users 
can  compare  obtained  experimental  results.  We  therefore 
require  that  the  status  of  each  resource  in  the  network  be 
monitored  and  that  these  results  be  reliably  exfiltrated. 

Reliability  also  implies  that  monitoring  must  be  inde¬ 
pendent  of  an  experiment’s  semantics  or  communication 
structure,  otherwise  design  or  implementation  errors  in 
an  experiment  could  yield  incorrect  monitoring  results. 

•  Efficiency.  Regardless  of  whether  a  monitor  collects 
health  status  from  some  or  all  nodes,  monitoring  should 
be  both  time  and  energy  efficient.  By  this  we  mean  that 
monitoring  must  complete  quickly,  using  few  messages. 

•  Handling  heterogeneity.  As  exemplified  in  Table  I, 
WSN  testbeds  may  have  different  hardware  and  software 
platforms  with  varying  capabilities  and  limitations.  A 
monitor  must  therefore  handle  heterogeneous  devices 
and  networks.  This  might  include  PCs,  embedded  Linux 
systems  such  as  Stargates  and  motes  such  as  Mica2s, 
XSMs  or  TelosBs,  as  well  as  a  variety  of  networks, 
including  mote  radio,  WiFi,  Ethernet,  and  others.  There 
can  be  several  instances  of  each  network  (such  as  multiple 
Ethernets).  A  monitor  must  be  able  to  check  the  status  of 
each  of  these. 

•  Diagnosing  failure  types.  Administrators  need  to  be 
able  to  distinguish  between  complete  device  failure  and 
interface  failure.  In  the  former  case,  devices  usually  have 
to  be  physically  repaired  or  replaced;  in  the  second,  a 
remote  correction  might  be  possible.  Hence  the  monitor 
must  distinguish  between  these  failures. 

•  Adaptability.  WSN  testbeds  keep  evolving  due  to  factors 
such  as  device  failures  and  replacements,  new  technology 
and  changing  user  requirements.  Testbed  configurations 
may  also  change  as  a  result  of  rewiring  and  physical 
relocation.  A  monitor  must  therefore  not  be  dependent 
on  any  particular  testbed  architecture  and  must  be  able 


to  easily  accommodate  different  types  of  networks  and 
devices  and  adapt  to  changes  in  network  configuration 
with  minimal  efforts. 

•  Usability.  Monitoring  results  must  be  available  centrally 
to  users  and  administrators.  Administrators  need  status 
information  periodically  to  detect  permanent  failures  or 
identify  failure  patterns  while  users  want  to  know  the 
network  status  before  and  after  an  experiment.  Hence 
monitoring  must  be  performed  both  automatically  or  on 
demand,  targeting  specific  devices  as  appropriate. 

•  Co-existing  with  experiments.  Users  may  be  interested 
in  monitoring  node  health  during  an  experiment.  Hence 
monitoring  should  be  easy  to  compose  with  user  appli¬ 
cations.  At  the  same  time,  a  monitor  must  have  at  most 
bounded  interference  with  experiments  during  concurrent 
operation.  This  is  of  particular  concern  for  radio  net¬ 
works,  but  can  affect  others  such  as  Ethernet  as  well. 

Existing  monitoring  approaches  such  as  Motelab  [2], 
SNMP  [3],  SNMS  [4],  fault  tracing  [5]  and  Sympathy  [6] 
satisfy  some  of  these  requirements  but  none  of  them  satisfy 
all  of  them.  In  particular,  none  distinguish  between  node  and 
interface  faults,  and  none  are  heterogeneous.  This  is  discussed 
in  more  detail  in  Section  II.  It  was  because  of  these  limitations 
that  we  developed  Chowkidar. 

Although  these  monitoring  requirements  are  desirable,  it 
might  be  impossible  to  satisfy  all  of  them  on  any  general 
testbed.  We  therefore  identify  the  following  set  of  assump¬ 
tions  about  testbeds  in  general  for  Chowkidar  to  realize  the 
monitoring  requirements  listed  above. 

•  There  is  an  organizational  structure  that  permits  auto¬ 
matic,  unattended  monitoring.  If  we  want  monitoring  to 
be  conducted  when  nodes  become  free,  for  instance,  there 
must  be  a  way  for  the  monitor  to  identify  those  nodes. 
In  Kansei,  the  scheduling  service  maintains  a  database  of 
node  status  that  Chowkidar  accesses. 

•  To  avoid  interference  with  experiments  running  concur¬ 
rently  on  the  testbed,  radio-based  communication  (includ¬ 
ing  mote  radio  and  802. 1 1  WiFi)  has  a  channel  dedicated 
to  monitoring;  or  else  policies  are  structured  in  such 
a  way  as  to  disable  certain  kinds  of  monitoring  when 
necessary  to  prevent  interference. 

C.  Organization  of  the  Paper 

The  rest  of  the  paper  is  organized  is  follows.  Section  II 
discusses  related  work.  Section  III  contains  a  detailed  de¬ 
scription  of  Chowkidar  including  its  fault  model  and  cen¬ 
tralized  and  distributed  versions  of  its  monitoring  protocol. 
Section  IV  describes  implementation  details  of  Chowkidar 
including  performance  results  measured  experimentally  on 
the  Kansei  testbed,  integration  with  Kansei  and  real  user 
experiences.  Finally  we  discuss  future  work  and  conclude  in 
Section  V. 
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II.  Related  Work 

A  variety  of  monitoring  facilities  have  been  developed  for 
testbeds  and  for  deployed  WSNs,  but  all  those  we  are  aware 
of  fall  short  of  our  needs.  Experiments  are  assumed  to  run 
on  homogeneous  devices;  although  a  testbed  itself  is  usually 
heterogeneous,  existing  support  tools  do  not  take  this  into 
account.  Tools  do  not  distinguish  between  the  health  of  a  node 
and  the  health  of  its  interfaces.  For  some,  the  reliability  is 
too  low  to  be  useful  and  for  others,  there  is  a  dependency 
on  the  communication  structure  of  the  application.  Also,  in 
most  testbeds  monitoring  and  the  subsequent  failure  diagnosis 
requires  explicit  action  on  the  part  of  a  user  or  administrator 
and  is  not  done  automatically. 

Traditional  networks  such  as  the  Internet  use  standard 
protocols  such  as  the  Simple  Network  Management  Protocol 
(SNMP)  [3]  for  monitoring  network  devices  and  identifying 
faults.  However,  SNMP  assumes  the  IP  routing  layer  in  its 
operation  and  is  therefore  dependent  on  the  fault-tolerance  of 
IP  to  be  able  to  reach  the  monitored  devices.  In  WSN  testbeds, 
there  are  often  multiple  paths  to  a  node  using  alternative 
networks  (such  as  mote  radio),  but  SNMP’s  dependence  on 
IP  precludes  its  use. 

Similarly,  Motelab  [2],  Tutornet  [7]  and  Orbit  [8]  provide 
users  with  a  ping-based  status  for  each  device,  indicating 
whether  it  is  reachable  or  not.  However,  simply  detecting  that 
a  device  is  unresponsive  on  a  given  network  is  not  sufficient 
since  it  does  not  support  heterogeneous  networks  and  does  not 
distinguish  network  and  device  interface  faults. 

The  Sensor  Network  Management  System  (SNMS)  [4] 
provides  networking  support  for  WSNs  via  its  own  networking 
stack,  including  routing.  SNMS  allows  network  administrators 
to  remotely  query  network  devices  and  learn  their  status. 
However,  SNMS  does  not  deal  with  heterogeneous  networks, 
and  experimental  studies  such  as  [9]  have  shown  that  reliability 
of  SNMS  does  not  suffice  to  provide  accurate  fault  status. 

Sympathy  [6]  is  designed  for  fault  detection  at  a  central  base 
station  in  a  data  collection  application  in  which  nodes  periodi¬ 
cally  send  data  to  the  base.  Sympathy  thus  exploits  knowledge 
of  a  specific  application’s  traffic  pattern  to  define  certain  fault 
metrics.  Sympathy  monitors  the  How  of  application  traffic, 
evaluates  the  defined  metrics,  and  communicates  them  to  the 
base  station  using  additional  messages.  This  information  is 
collected  by  an  automated  failure  detector  program  at  the  base 
station,  which  tries  to  localize  the  type  and  the  source  of  the 
faults  in  the  network  and  notifies  the  user.  A  similar  approach 
is  used  in  [5]  where  the  fault  management  system  exploits 
not  only  the  continuous  data  traffic  flow  in  the  network  to 
piggy-back  health  information,  but  also  uses  the  route  update 
messages  in  the  routing  protocol  to  effect  changes  in  routing 
paths  for  suspected  nodes  in  order  to  trace  failed  nodes.  These 
approaches,  although  similar  to  ours,  depend  on  knowledge  of 
application  routing  and  traffic  patterns.  Neither  is  heteroge¬ 
neous,  nor  is  monitoring  conducted  when  an  application  is 


not  running. 

III.  Monitoring  a  WSN  Testbed  with  Chowkidar 

In  this  section,  we  first  describe  the  fault  model  for  Chowki¬ 
dar  and  our  approach  for  diagnosing  failure  dependencies. 
We  then  present  two  Chowkidar  protocols:  the  first  uses 
centralized  control  while  the  second  one  is  distributed.  Recall 
from  Section  I-B  that  reliability  and  efficiency  are  the  key  re¬ 
quirements  for  WSN  testbed  monitoring,  hence  we  especially 
focus  on  these  aspects  in  our  discussion. 

Although  the  two  protocols  differ  in  a  number  of  ways,  they 
both  have  the  following  characteristics. 

•  Monitoring  information  is  collected  and  evaluated  cen¬ 
trally  at  a  base  station.  The  base  station  is  aware  of  the 
topology  of  the  network,  and  knows  which  nodes  are  in 
use  by  experiments  and  which  are  not.  This  information 
is  used  to  perform  monitoring,  to  assess  the  results,  and 
to  provide  current  and  historical  status. 

•  Monitors  are  correct  in  the  presence  of  node  and  link  fail- 
stops  and  restarts  that  occur  during  the  run  or  between 
runs.  A  given  collection  either  gives  a  consistent  result 
(corresponding  to  testbed  state  that  existed  during  the  run) 
or  reports  failure.  In  the  latter  case,  the  monitor  can  be 
run  again. 

•  All  devices  are  explored  for  reachability  along  all  avail¬ 
able  networks,  with  a  typical  path  consisting  of  multiple 
networks. 

•  Paths  to  resources  are  least-cost,  where  the  cost  of  a 
network  link  is  assigned  by  the  administrator,  taking 
into  account  factors  such  as  bandwidth,  reliability,  and 
interference. 

•  Monitoring  runs  are  done  periodically  and,  at  the  request 
of  the  user  or  the  testbed  scheduler,  can  be  done  on 
demand. 

•  Between  monitoring  runs,  no  resources  are  used  except 
for  components  that  listen  for  monitoring  messages. 

As  a  minimum,  we  want  Chowkidar  to  monitor  testbed 
resources  that  are  not  in  use.  We  assume  that  the  testbed  has 
a  scheduling  service  with  the  following  characteristics. 

•  When  an  experiment  begins  on  some  set  of  resources, 
Chowkidar  is  informed  that  those  resources  are  in  use. 

•  When  an  experiment  ends,  Chowkidar  components  are 
automatically  loaded  on  those  components  and  Chowki¬ 
dar  is  informed  that  the  resources  are  available. 

In  our  Kansei  implementation,  Director  is  the  scheduling 
service;  it  maintains  an  SQL  database  of  resource  status. 
Chowkidar  accesses  this  database  to  assess  resource  status  for 
monitoring. 

Since  Chowkidar  is  configuration-driven,  it  is  testbed- 
independent.  For  a  given  testbed,  communication  components 
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for  each  network  on  a  node  must  be  provided  along  with  a 
configuration  of  the  testbed  as  a  whole.  Other  than  this,  no 
changes  need  be  made  to  the  architecture  of  Chowkidar  to  use 
it  on  another  testbed. 

A.  Fault  Model 

We  assume  fail-stop  faults  with  clean  restart.  When  a  node 
fails,  it  does  not  communicate  on  any  of  its  interfaces.  When 
an  interface  fails,  the  node  cannot  communicate  on  it.  When 
a  node  restarts,  it  does  so  cleanly,  reinitializing  variables 
as  required  by  the  monitoring  component.  We  assume  an 
interface  does  not  have  state,  so  a  restart  is  always  clean. 

A  fault  can  happen  at  any  time,  including  during  a  moni¬ 
toring  run.  If  this  happens,  we  want  the  monitoring  either  to 
report  the  failure  so  that  it  can  be  restarted,  or  we  want  the 
collection  to  be  consistent  in  the  sense  that  the  state  reported 
actually  occurred  in  the  system  during  the  run. 

For  example,  suppose  a  reachable  node’s  status  has  been 
confirmed  and,  during  the  balance  of  the  monitoring  run,  it 
is  not  accessed  again.  If  this  node  fails  before  the  end  of  the 
monitoring  run,  the  report  (that  the  node  is  reachable)  will 
be  inconsistent  with  the  final  state  of  the  system  (in  which  it 
is  not);  but  it  will  be  consistent  with  the  state  of  the  system 
before  the  node  failed. 

On  the  other  hand,  if  a  node  was  found  to  be  reachable 
at  one  point  during  the  run  but  found  to  be  unreachable  later, 
then  the  run  would  terminate  with  an  error  value.  If  we  assume 
that  faults  stop,  or  at  least  stop  long  enough,  then  an  erroneous 
run  can  be  rescheduled,  and  eventually  it  will  succeed. 

Links  in  a  WSN  can  be  unidirectional  or  bidirectional. 
While  we  do  not  require  all  links  to  be  bidirectional,  we  do 
assume  that  a  bidirectional  path  exists  from  the  root,  or  the 
Chowkidar  server,  to  each  node.  Thus,  our  protocols  may  try 
to  use  some  unidirectional  links  and  fail,  but  eventually  they 
will  discover  a  bidirectional  path  if  it  exists. 

Wireless  links  are  subject  to  interference  in  the  presence  of 
multiple  concurrent  senders.  Interference  affects  link  reliability 
and  may  result  in  message  losses  due  to  collisions.  Message 
losses  during  exfiltration  of  health  data  from  the  testbed  may 
affect  monitoring  reliability. 

B.  Diagnosing  Dependencies 

Basic  monitoring  gives  reachability  information  about  nodes 
and  interfaces.  If  a  node  is  reachable  then  it  is  up;  but  if  not,  we 
cannot  automatically  conclude  its  status,  since  it  might  be  still 
be  functional,  in  the  sense  that  it  would  be  reachable  if  placed 
in  a  suitable  environment.  Although  unreachable,  a  node  might 
be  functional  if  the  power  is  off,  if  all  of  its  interfaces  are  not 
functional,  or  if  it  can  only  be  reached  through  other  nodes 
that  are  unreachable. 

We  have  to  consider  the  case  of  devices  where  checking 
status  depends  on  some  other  device.  These  include  “dumb” 


nodes  such  as  Ethernet  hubs  that  cannot  be  addressed  directly 
as  well  as  device  interfaces  and  power  supplies. 

To  know  whether  a  dumb  node  such  as  an  Ethernet  hub 
is  working  or  not  we  have  to  try  to  reach  an  attached  node. 
Hence  if  some  attached  node  is  reachable  via  the  dumb  node 
then  the  dumb  node  is  reachable  and  therefore  up.  However, 
if  no  attached  node  is  reachable  via  the  dumb  node  then  either 
the  dumb  node  is  not  functioning,  all  of  the  attached  devices 
are  not  functioning,  all  or  the  interfaces  of  the  attached  devices 
are  not  functioning,  or  some  combination.  Of  course,  there  can 
be  additional  dependencies,  since  the  interface  of  an  attached 
node  won’t  function  if  the  node  itself  is  not  functioning,  which 
in  turn  might  be  due  to  a  power  supply  failure. 

For  power  supplies,  the  supply  is  up  if  some  associated 
node  is  up.  If  all  associated  nodes  are  not  reachable  then 
either  the  power  supply  is  not  functioning  or  all  the  nodes 
are  unreachable  for  some  other  reason. 

For  interfaces,  an  interface  on  a  node  is  reachable  if  it  can 
be  used  to  reach  some  other  node;  in  this  case,  the  interfaces 
are  up  as  are  both  nodes  and  the  link  between  the  nodes. 
Although  unreachable,  an  interface  might  be  functional  if  its 
node  is  unreachable  or  if  no  other  nodes  are  reachable  via  the 
interface. 

Identifying  these  alternatives  requires  knowledge  of  the 
testbed  topology.  Intuitively,  some  alternatives  are  more  likely 
than  others.  If  a  hub  and  all  of  its  attached  nodes  are 
unreachable  via  the  hub,  it  is  more  likely  that  it  is  the  hub 
that  is  not  functional.  These  intuitions  can  be  used  to  order 
the  alternatives  to  guide  a  testbed  administrator.  As  part  of 
future  work  we  plan  to  gather  data  that  relate  alternatives  to 
the  actual  diagnosis  so  that  probabilities  can  be  assigned  based 
on  historical  data. 

C.  Centralized  Chowkidar 

The  centralized  version  of  Chowkidar  performs  a  collection 
on  the  network  that  indicates  which  nodes  are  reachable.  It 
also  gives  information  about  the  status  of  interfaces.  The  main 
idea  is  that  the  base  station,  using  configuration  information 
and  knowledge  of  nodes  that  are  in  use,  attempts  to  construct 
a  path  to  each  node  whose  status  is  unknown,  avoiding  links 
that  are  down.  The  process  terminates  either  when  all  nodes 
are  confirmed  as  up  or  when  there  are  no  more  paths  to  check. 

Since  a  given  collection  does  not  depend  on  previous  ones, 
the  protocol  handles  both  fail-stop  and  restart  directly.  If 
failure  happens  while  the  protocol  is  running  that  affects 
the  consistency  of  the  collection,  it  will  be  detected  and  the 
collection  aborted.  Restarts  can  happen  at  any  time  and  will 
be  included  the  next  time  a  path  is  created  that  includes  the 
resource. 

The  configuration  information  can  be  provided  in  two  ways. 
In  the  “high  atomicity”  view,  the  configuration  consists  of 
nodes  with  network  links  between  them.  In  this  case,  a 
collection  will  give  all  reachable  nodes  but  may  not  check  all 
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interfaces.  Alternatively,  the  configuration  can  be  given  with 
"low  atomicity”  in  which  interfaces  are  explicitly  included. 
This  forces  Chowkidar  to  explore  paths  that  include  each 
interface,  thus  checking  each  one. 

The  process  is  as  follows,  where  initially,  the  status  of  each 
node  and  link  is  unknown. 

1)  Using  testbed  configuration  information,  construct  a 
least  cost  path  (LCP)  tree  that  covers  the  free  nodes 
of  unknown  status,  using  links  that  are  not  down,  with 
the  constraint  that  all  leaves  are  nodes  whose  status  is 
unknown.  The  tree  can  be  empty,  which  means  that  all 
reachable  nodes  have  been  found,  and  we  are  done. 

2)  For  each  path,  construct  a  probe  message  that  contains 
the  path.  Send  the  message  to  the  first  node  along  the 
designated  interface.  As  each  node  receives  the  message, 
it  forwards  it  to  the  next  node  if  any,  and  waits  for  a 
reply.  If  a  node  is  the  leaf  in  the  path,  it  sends  a  reply 
back  along  the  path.  If  a  node  that  is  waiting  for  a  reply 
receives  it,  it  forwards  it  on;  if  it  does  not  arrive  after 
a  timeout  (which  can  be  calculated  from  the  path),  it 
replies  on  its  own. 

3)  When  a  reply  arrives  at  the  base  station,  the  knowledge 
of  the  path  and  the  node  that  replied  lets  the  base  station 
identify  the  nodes  and  links  that  were  reachable  along 
that  path;  these  are  marked  as  “up”.  If  some  node  other 
than  the  leaf  replied  then  the  downward  link  is  marked 
as  down  and  is  effectively  removed  from  the  topology 
for  the  duration  of  the  run. 

A  network  link  involves  a  sender  and  a  receiver; 
hence  lack  of  responsiveness  on  the  link  implies  that  the 
sender’s  interface  is  down  or  that  the  receiver’s  interface 
is  down  or  that  the  downward  node  node  is  down  (or 
any  combination).  In  any  case,  we  consider  the  link  as 
failed  and  do  not  use  it  again:  it  is  removed  from  the 
configuration  for  the  duration  of  the  run. 

4)  Resume  from  #1. 

If,  during  the  run,  a  particular  node  or  link  was  found  to  be 
up  as  the  result  of  some  probe  but  later  when  that  node  or  link 
was  reused,  it  was  found  to  be  down,  then  a  node  or  link  fault 
has  occurred  that  affects  consistency.  In  that  case,  we  abort 
the  run  and  restart.  If  the  run  did  not  abort  with  an  error,  then 
the  collection  is  consistent.  Hence,  in  the  presence  of  restart 
or  of  fail-stop  that  does  not  affect  consistency,  centralized 
Chowkidar  will  terminate  with  a  consistent  collection.  If  a  fail- 
stop  happens  that  might  affect  consistency,  it  will  terminate 
with  an  error. 

As  a  protocol  that  checks  reachability  by  exploring  all 
paths  sequentially,  centralized  Chowkidar  does  not  scale  well. 
Calculating  an  LCP  tree  takes  0(|iV|  •  |i?|)  time  where  N  is 
the  number  of  nodes  and  E  is  the  number  of  edges  per  node. 
Since  a  network  can  be  fully  connected,  the  time  complexity 
is  0(N2).  It  results  in  0(N )  probes  and,  since  the  set  of 
probes  cover  the  nodes,  there  are  0(N )  messages  total.  The 


number  of  times  the  LCP  tree  calculation  process  and  probing 
has  to  be  repeated  depends  on  the  pattern  of  failures.  It  can 
be  as  high  as  0(N2)  in  a  network  where  half  the  nodes  are 
down  and  each  is  directly  linked  to  an  up  node;  in  this  case, 
paths  will  be  created  from  each  up  node  to  each  down  one. 
Hence  time  complexity  can  be  as  high  as  0(N4)  and  message 
complexity  as  high  as  0(N3). 

D.  Distributed  Chowkidar 

As  described  earlier,  in  the  worst  case  pattern  of  failures,  the 
time  and  message  complexity  of  the  centralized  protocol  can 
be  quite  high.  From  our  experience  with  Kansei,  a  collection 
for  centralized  Chowkidar  can  take  up  to  10  minutes  for  the 
low  atomicity  case  for  210  nodes.  Kansei  is  in  the  process  of 
growing,  with  up  to  630  TelosB  motes  to  be  added  in  the  near 
future,  thus  clearly  the  centralized  approach  will  not  scale. 

We  have  therefore  developed  a  self-stabilizing  distributed 
protocol  [10]  to  be  used  as  part  of  Chowkidar.  This  protocol 
solves  an  instance  of  the  well-known  problem  of  message¬ 
passing  rooted  spanning  tree  construction  and  its  use  in  PIF 
(propagation  of  information  with  feedback)  for  the  case  of  a 
WSN.  Our  protocol  differs  from  previous  work  in  message¬ 
passing  PIFs  in  two  ways  that  are  critical  for  the  WSN  model. 
First,  it  is  message  efficient  in  that  it  uses  only  a  few  messages 
per  node,  which  is  important  given  the  resource  constraints  of 
WSNs.  Second,  it  tolerates  ongoing  node  as  well  as  link  faults, 
and  their  restart,  which  do  indeed  occur  in  WSNs,  in  contrast 
to  requiring  that  faults  stop  during  convergence. 

Our  distributed  protocol  first  builds  a  spanning  tree  over 
the  set  of  reachable  nodes  in  the  network.  A  key  idea  in 
this  tree  construction  is  a  handshake  between  a  node  and  its 
potential  parent.  At  the  start  of  an  execution,  the  root  (or  the 
central  Chowkidar  server)  broadcasts  a  wave  message  on  all 
its  outgoing  interfaces  with  a  session  number  higher  than  any 
used  previously.  When  a  node  X  receives  a  wave  broadcast 
from  another  node  Y  with  a  higher  session  number,  it  asks 
Y  to  become  its  parent.  Y  records  X  as  a  child  and  sends  an 
acknowledgement.  Our  protocol  also  forms  LCPs  from  each 
node  to  the  root  by  phasing  the  delivery  of  the  wave  messages: 
nodes  forward  a  wave  message  on  all  outgoing  links,  however 
on  links  with  lower  cost,  the  messages  are  forwarded  earlier 
than  on  links  with  higher  cost.  A  node  that  is  connected  to 
the  root  through  multiple  paths  will  therefore  receive  a  wave 
message  on  the  LCP  first  as  the  total  forwarding  delay  is 
proportional  to  the  total  path  cost  and  select  it. 

If  node  or  interface  failures  do  not  happen  during  the  tree 
formation,  the  result  is  a  tree  with  bidirectional  edges:  each 
child  knows  its  parent  and  each  parent  knows  its  children. 
When  a  PIF  is  subsequently  run  on  the  tree,  each  parent  waits 
on  its  children  to  report  before  it  reports  to  its  parent;  if 
the  parent  fails  to  hear  from  a  child  in  a  timely  fashion,  it 
initiates  a  failure  message  to  the  root,  in  which  case  a  new 
tree  is  constructed.  The  acknowledgement  process  between 
the  proposed  parent  and  child  combined  with  child  timeouts 
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lets  us  handle  failures  that  occur  during  the  acknowledgement 
sequence.  If  a  node  fails  to  receive  the  acknowledgement  from 
the  proposed  parent  then  it  does  not  join  the  tree  but  waits  for 
another  wave  message  from  another  neighbor.  Under  certain 
fault  conditions,  it  may  be  possible  for  two  nodes  to  consider 
the  same  node  as  their  child;  however  this  fault  is  detected 
during  the  PIF  phase  when  one  of  them  does  not  receive  a 
report  from  that  child.  A  formal  protocol  description  can  be 
found  in  [10]  and  proofs  of  its  correctness  in  a  related  technical 
report  [11]. 

When  a  tree  formation  or  PIF  phase  is  complete,  the 
protocol  is  quiescent,  so  there  is  no  ongoing  message  traffic 
unless  a  node  restarts.  In  the  absence  of  failures,  a  total 
of  three  messages  per  node  are  required  for  tree  formation: 
one  for  the  wave,  two  for  the  parent  acknowledgement.  If 
there  are  no  failures  once  a  correct  tree  has  been  constructed, 
subsequent  PIFs  will  continue  to  use  the  same  tree.  Thus, 
for  every  such  PIF,  two  messages  per  node  are  required: 
one  to  propagate  the  wave  and  one  to  return  the  feedback. 
If  failures  occur  during  the  parent  acknowledgement  process, 
additional  messages  are  required  as  a  node  attempts  to  confirm 
with  subsequent  potential  parents.  However,  this  occurs  only 
if  a  failure  happens  after  the  wave  message  but  before  the 
acknowledgement  is  complete. 

IV.  Chowkidar  Implementation  for  Kansei 

We  have  implemented  both  the  centralized  and  distributed 
Chowkidar  protocols  for  the  Kansei  testbed,  though  both 
implementations  can  easily  be  adapted  to  other  testbeds  with 
minor  modifications.  Our  implementations  span  the  different 
hardware  and  software  platforms  in  Kansei  listed  in  Table  I. 
In  this  section,  we  first  compare  the  performance  of  both 
protocols  based  on  data  collected  from  several  experiments 
in  Kansei.  We  then  describe  some  important  lessons  learnt 
during  the  integration  of  centralized  Chowkidar  and  some  real 
experiences  from  both  users  and  administrators  of  Kansei  in 
using  Chowkidar. 

A.  Experimental  Results 

To  evaluate  the  performance  benefits  of  using  the  distributed 
protocol  over  the  centralized  one,  we  ran  a  number  of  ex¬ 
periments  using  both  implementations  in  our  Kansei  testbed. 
We  ran  both  protocols  on  the  same  sets  of  nodes.  In  the 
initial  experiments,  we  tested  correctness  in  the  absence  of 
failures  by  simply  executing  the  protocols  on  nodes  that  were 
known  to  be  working.  We  then  injected  failures  by  killing 
Chowkidar  processes  on  randomly  selected  nodes  (the  same 
nodes  for  both  cases).  Table  II  shows  the  experimentally 
measured  performance  for  a  set  of  25  nodes.  This  data  does 
not  take  into  account  the  time  taken  to  compute  the  paths  in 
the  centralized  case  as  this  is  quite  small  on  a  powerful  server. 
Recall  that  the  distributed  protocol  first  constructs  a  spanning 
tree  over  which  subsequent  PIFs  can  be  collected,  hence  the 


total  time  for  the  distributed  protocol  is  the  sum  of  the  times 
taken  by  each  of  these  phases. 


%  of  failed  nodes 

Tcent 

Tdist 

Ttree 

TpiF 

0% 

9s 

7s 

2s 

5s 

8% 

54s 

14s 

8.5s 

5.5s 

20% 

86s 

16s 

10.5s 

5.5s 

40% 

153s 

17s 

11.5s 

5.5s 

TABLE  II 

Performance  comparison  on  a  25  node  network. 


As  seen  from  the  data,  in  the  absence  of  faults,  the  per¬ 
formance  of  the  centralized  and  the  distributed  protocols  is 
quite  comparable.  However,  the  performance  of  the  centralized 
protocol  degrades  substantially  as  the  number  of  failures 
increases,  even  for  a  25  node  network.  This  is  because  the 
centralized  protocol  not  only  operates  sequentially  but  also 
tries  to  explore  all  possible  working  paths  in  case  of  a  failed 
node  before  giving  up.  By  contrast,  the  distributed  protocol 
finds  existing  paths  concurrently  instead  of  pruning  failed  ones 
sequentially,  so  its  performance  is  only  marginally  affected.  It 
should  be  understood  though  that  the  centralized  protocol  was 
inherently  reliable  due  to  its  sequentially  design  that  avoids 
interference  losses  whereas  the  distributed  implementation 
had  to  be  carefully  tuned  to  select  appropriate  randomized 
backoff  parameters  to  minimize  network  interference  created 
by  concurrent  execution.  The  centralized  protocol  could  also 
be  parallelized  and  similarly  tuned  for  reliability,  but  it  is  clear 
that  it  will  not  outperform  the  distributed  one. 

We  also  experimentally  measured  the  scalability  of  both 
protocols  by  varying  the  network  size,  the  results  of  which 
are  shown  in  Table  III. 


Total  # 
of  nodes 

%  of  node 
failures 

Tcent 

Tdist 

Ttree 

TpiF 

25 

0% 

9s 

7s 

2s 

5s 

50 

0% 

23s 

9s 

3s 

5s 

25 

40% 

153s 

17s 

11.5s 

5.5s 

50 

40% 

305s 

23s 

17s 

6s 

TABLE  III 

Scalability  of  centralized  vs.  distributed  protocols. 


The  first  two  rows  in  the  table  indicate  the  completion  times 
for  both  protocols  in  the  absence  of  any  injected  faults  while 
the  last  two  rows  indicate  the  completion  time  when  (the  same) 
40%  nodes,  selected  randomly  in  the  network,  are  failed.  The 
data  clearly  demonstrates  that  as  the  network  size  increases, 
the  performance  of  the  centralized  protocol  degrades  much 
faster  than  the  distributed  one. 

Another  important  point  to  note  in  the  distributed  case  is 
that  the  PIF  completion  time  increases  only  slightly  as  the 
failure  rate  and  network  size  are  increased.  This  is  because 
the  PIF  completion  time  is  a  function  of  the  depth  of  the 
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constructed  spanning  tree.  Also,  since  the  same  tree  is  used 
when  there  are  no  new  failures,  the  PIF  cost  is  amortized  over 
multiple  successive  runs,  hence  if  failures  occur  rarely,  the 
average  completion  time  for  the  distributed  protocol  is  even 
smaller. 

Our  experiments  thus  show  that  when  carefully  tuned  for 
reliability,  the  distributed  protocol  outperforms  the  centralized 
one  and  scales  much  better  as  both  failure  rate  and  network 
size  are  increased. 


B.  Integration  with  Kansei 

As  noted,  Chowkidar  is  testbed-independent.  As  a  case 
study,  we  have  integrated  its  centralized  implementation  with 
the  Kansei  testbed,  which  satisfies  the  requirements  mentioned 
earlier. 

The  Director  service  in  Kansei  is  a  distributed  implemen¬ 
tation  that  schedules  and  manages  experiments,  automatically 
terminating  them  when  the  reserved  time  has  passed.  When 
that  happens,  each  Stargate  activates  Chowkidar's  Stargate 
components  and  loads  Chowkidar’s  mote  components;  since 
this  happens  locally,  it  does  not  depend  on  the  base  station 
and  hence  does  not  depend  on  reachability  via  Ethernet.  At 
the  base  station.  Director  updates  the  status  of  nodes  in  a 
central  database.  When  Chowkidar  is  scheduled  to  run,  either 
periodically  or  upon  demand,  it  accesses  this  database  and 
checks  free  nodes.  If  Director  needs  a  node  for  a  scheduled 
experiment,  it  kills  the  Chowkidar  components,  rendering  the 
node  inaccessible  to  Chowkidar,  and  updates  the  database.  If 
Chowkidar  has  completed  probing  the  nodes  just  removed  then 
there  is  no  problem;  otherwise,  if  the  node  was  up,  Chowkidar 
will  note  that  it  is  no  longer  accessible  and  will  terminate  with 
error  and  restart. 
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Fig.  2.  Visualization  of  Kansei  health  using  Chowkidar. 


C.  Experience  with  Chowkidar 

Since  its  integration  with  Kansei  about  a  month  ago,  users 
and  administrators  have  been  using  Chowkidar  quite  actively 
for  different  reasons.  In  this  section,  we  describe  our  initial 
experiences  and  lessons  learnt  from  user  feedback  after  de¬ 
ploying  Chowkidar. 


There  are  various  policy  issues  that  require  coordination 
between  Director  and  Chowkidar.  XSM  and  TelosB  motes 
have  several  non-interfering  radio  channels  available  in  their 
operational  frequency  band.  Radio  communication  in  Chowki¬ 
dar  can  thus  occur  on  a  reserved  frequency,  avoiding  interfer¬ 
ence  with  experiments.  However,  because  Kansei  is  located  in 
a  warehouse  with  industrial  neighbors,  interference  prevents 
Chowkidar  from  using  a  reserved  frequency  for  802.11b  WiFi. 
To  avoid  interference,  Director  should  note  experiments  that 
use  the  WiFi  network  so  that  Chowkidar  can  avoid  using  it. 
A  similar  policy  issue  concerns  Ethernet  in  case  Chowkidar’s 
use  of  Ethernet  might  interfere  with  an  experiment. 

Since  Stargates  have  substantially  more  resources  than  XSM 
and  TelosB  motes,  it  is  reasonable  to  perform  monitoring  on 
them  even  when  they  are  in  use  by  an  experiment.  However, 
a  particular  experiment  might  prefer  that  Chowkidar  not  run 
during  that  time.  Director  and  Chowkidar  need  to  be  set  up 
so  experimenters  can  indicate  their  preference. 

Implementation  of  these  policy  issues  is  part  of  our  ongoing 
work. 


Visualizing  test  results.  At  the  end  of  a  run,  Chowkidar 
timestamps  and  stores  the  collected  results  in  a  database.  The 
results  are  also  displayed  on  a  webpage  [12]  so  that  they  are 
easily  readable  to  users  and  administrators. 

Figure  2  shows  a  screenshot  of  the  output  generated  by 
Chowkidar  which  represents  a  high-level  view  of  Kansei 
health.  In  this  visualization,  each  Stargate  and  its  attached 
motes  are  represented  by  a  single  logical  node.  Logical  nodes 
with  all  devices  working  correctly  are  denoted  by  simple 
circles  whereas  if  either  of  these  devices  or  their  interfaces 
have  failed,  the  node  is  denoted  by  a  bold,  broken  circle.  Nodes 
whose  status  could  not  be  monitored  because  no  neighbors 
were  reachable  are  denoted  by  squares.  Users  can  learn  more 
about  exactly  which  devices  and  interfaces  have  failed  by 
clicking  on  the  corresponding  circle  or  square.  A  graph  shows 
recent  history  of  the  devices.  As  seen  in  the  figure,  the 
visualizer  allows  users  to  view  the  output  of  previous  runs  by 
specifying  their  id  or  timestamp.  We  are  currently  extending 
the  visualizer  to  display  the  health  history  of  a  particular  node. 
Since  experiments  can  leave  nodes  in  an  inaccessible  status, 
we  plan  to  display  cases  where  nodes  failed  immediately  after 


an  experiment. 

Using  Chowkidar  results.  In  the  past,  Kansei  users  would 
schedule  experiments  on  a  set  of  nodes  only  to  realize  later 
that  some  of  them  were  not  working.  This  usually  led  to  users 
having  to  retry  several  times  before  they  finally  ended  up  with 
a  set  of  working  nodes.  However,  users  now  check  the  latest 
Chowkidar  output  or  invoke  Chowkidar  on-demand  prior  to 
scheduling  experiments  so  their  experiments  always  run  on 
working  nodes.  By  running  Chowkidar  on-demand  after  an 
experiment,  users  are  also  able  to  verify  that  no  new  failures 
occurred  during  their  experiment,  increasing  confidence  in  the 
obtained  data. 

Chowkidar  is  also  being  used  by  Kansei  administrators  to 
diagnose  failures.  Previously,  an  administrator  would  execute 
a  script  to  ping  all  Stargates  and  then  individually  diagnose 
the  ones  that  did  not  respond.  This  approach  is  not  only 
tedious  but  also  does  not  work  for  non-IP  based  devices  such 
as  motes  that  do  not  respond  to  ping  commands.  However, 
using  Chowkidar,  Kansei  administrators  are  now  able  to  detect 
several  new  types  of  faults  such  as  mote  failure,  failure  of 
particular  interfaces,  etc.  Even  in  cases  where  Chowkidar 
cannot  definitively  diagnose  what  fault  has  occurred,  it  pro¬ 
vides  enough  information  to  the  administrators  to  simplify 
manual  debugging.  Administrators  are  also  using  historical 
information  in  Chowkidar  to  identify  failure-prone  devices. 
For  instance,  if  a  particular  mote  oscillates  between  correct 
and  failed  states  over  many  Chowkidar  runs,  it  is  highly  likely 
that  its  serial  connector  to  the  Stargate  may  have  become  loose 
leading  to  disruptions  in  power  supply  to  the  mote. 

Given  that  the  implementation  of  Chowkidar  is  reliable,  we 
have  been  able  to  use  Chowkidar  for  diagnosing  failures  of  the 
Kansei  Director  service,  which  keeps  evolving  as  new  features 
are  added  and  so  is  occasionally  subject  to  programming 
bugs.  For  instance,  if  an  unusually  high  number  of  nodes  are 
reported  as  failed  by  Chowkidar,  it  is  likely  that  some  Director 
component  may  have  failed,  causing  the  Chowkidar  service 
itself  to  not  be  loaded  correctly. 

Monitoring  predicates.  After  Chowkidar  went  online,  we 
received  feedback  from  several  Kansei  users  and  adminis¬ 
trators  about  additional  information  they  would  like  to  be 
monitored.  For  example,  testbed  administrators  were  interested 
in  monitoring  whether  the  various  Director  processes  were 
running  correctly  on  Stargates  while  users  wanted  to  know 
whether  the  SerialForwarder  program  used  to  send/receive 
TinyOS  packets  to/from  motes  was  running  throughout  an 
experiment.  As  a  result,  we  identified  several  new  predicates 
besides  node  and  network  health  that  can  now  be  monitored 
using  Chowkidar. 

V.  Conclusions  and  Future  Work 

A  health  monitoring  service  is  critical  to  managing  the 
complexity  inherent  in  large  testbeds  due  to  network  het¬ 


erogeneity  and  the  occurrence  of  different  types  of  faults. 
In  this  paper  we  presented  Chowkidar,  a  reliable  service 
that  provides  a  way  to  monitor  the  status  of  heterogeneous 
testbeds  automatically.  By  reliable,  we  imply  that  Chowkidar 
only  reports  information  that  is  consistent  with  the  actual 
testbed  state,  so  this  information  is  useful  to  users  interested 
in  running  experiments  on  healthy  nodes  and  to  administrators 
who  additionally  use  diagnosis  data  provided  by  Chowkidar  to 
keep  the  testbed  fully  functional.  We  presented  two  Chowkidar 
protocols,  experimentally  compared  their  performance  and 
analyzed  scalability  issues.  We  also  described  experiences 
from  real  testbed  users  and  administrators  showing  the  value 
added  to  testbed  experimentation  and  maintenance  by  the 
integration  of  Chowkidar  with  Kansei. 

Future  work  will  mainly  focus  on  three  aspects;  extending 
the  functionality  of  Chowkidar,  making  Chowkidar  more  effi¬ 
cient  and  improving  the  usability  of  Chowkidar.  We  identify 
the  important  tasks  in  each  of  these. 

An  important  predicate  that  needs  to  be  monitored,  espe¬ 
cially  in  WSNs,  is  the  quality  of  radio  links.  Monitoring  this 
predicate  requires  the  exchange  of  several  messages  before  an 
evaluation  can  be  made.  We  plan  to  integrate  a  link  estimation 
service  so  that  Chowkidar  can  report  on  link  quality  in  addition 
to  basic  “up”ness.  The  evaluated  predicates  can  also  provide 
feedback  to  Chowkidar  itself,  so  that  a  node  dynamically 
adjusts  link  costs  depending  on  the  estimation. 

Our  current  Chowkidar  implementation  ignores  the  moni¬ 
toring  of  sensors,  which  are  an  important  resource  in  WSN 
testbeds.  Monitoring  sensor  health  is  difficult  due  to  several 
reasons.  First,  ground  truth  is  often  not  available,  even  in  a 
controlled  testbed  setting,  so  there  is  no  absolute  reference 
point  for  evaluating  obtained  sensor  readings  [13].  Second, 
an  understanding  of  the  physical  model  is  critical,  especially 
when  comparing  the  readings  from  nearby  sensors.  Third,  an 
understanding  of  the  effect  of  hardware  and  other  environmen¬ 
tal  variations  is  important.  We  plan  to  use  robots,  that  are  part 
of  the  mobile  Kansei  platform,  to  help  monitor  the  health  of  a 
sensor  by  generating  a  known  signature  in  its  neighborhood. 
Similarly,  we  wish  to  monitor  actuator  health  using  calibrated 
sensors. 

At  present,  Chowkidar  does  not  distinguish  interface  failure 
from  misconfiguration.  In  general,  however,  misconfiguration 
can  be  detected  by  reading  the  status  of  device  registers  or 
environment  variables,  so  we  plan  to  add  configurations  to 
the  list  of  predicates  to  be  monitored  by  Chowkidar.  When  a 
device  interface  is  misconfigured  but  the  device  is  accessible 
via  some  other  interface,  this  fact  can  be  reported. 

As  network  scale  increases,  the  issue  of  bidirectional  link 
reliability  becomes  more  important.  Towards  improving  the 
efficiency  of  Chowkidar  in  unreliable  wireless  environments, 
we  will  compare  the  performance  of  CSMA-based  approaches 
combined  with  appropriate  timing  choices  for  backoffs,  that 
provide  probabilistic  guarantees  about  accuracy  with  determin¬ 
istically  reliable  schemes  such  as  TDMA. 
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A  node  interface  has  two  parts,  a  transmitter  and  a  receiver. 
Evaluating  the  receiver  locally  is  easy  but  a  neighbor  is  needed 
to  evaluate  the  transmitter.  However,  a  broadcast  might  be 
heard  by  many  neighbors  and  if  they  all  report  it,  there  will  be 
excessive  redundancy.  Also,  there  may  be  other  predicates  that 
involve  a  node’s  neighbors,  but  those  neighbors  could  be  in 
different  subtrees,  so  the  structure  of  the  spanning  tree  could 
work  against  us.  Future  research  will  focus  on  techniques  such 
as  data  compression  and  in-network  aggregation  to  improve 
efficiency  of  collecting  transmitter  health. 

Besides  functionality  and  efficiency,  we  also  plan  to  improve 
the  usability  of  Chowkidar.  Chowkidar  presently  monitors  only 
those  nodes  that  are  not  running  a  user  experiment.  Monitoring 
the  health  of  nodes  running  an  experiment  is  desirable  from  a 
user’s  perspective  to  improve  confidence  in  the  experiment 
outcome.  We  plan  to  address  the  concurrent  execution  of 
Chowkidar  with  a  user  experiment  in  two  ways.  First,  we  will 
provide  a  standard  set  of  lightweight  Chowkidar  components, 
along  with  tools  for  easy  integration  of  user  and  Chowkidar 
components.  Second,  we  will  define  mechanisms  whereby 
users  can  specify  policies  that  dictate  which  and  what  fraction 
of  available  resources  on  a  node  running  a  user  experiment  can 
be  used  by  Chowkidar  for  health  monitoring.  This  will  provide 
flexibility  to  users  in  controlling  the  interference  between  the 
experiment  and  health  monitoring.  Another  interesting  idea  for 
future  research  is  to  investigate  whether  there  is  a  systematic 
way  to  exploit  the  semantics  of  an  application  for  monitoring 
while  still  offering  correctness  guarantees. 

At  present,  the  integration  of  Chowkidar  with  Kansei  is 
one-way,  since  the  information  reported  by  Chowkidar  is  not 
used  by  Director.  Future  integration  steps  will  involve  Director 
using  the  output  of  Chowkidar  to  automatically  select  a  set  of 
nodes  that  are  known  to  be  good  to  run  an  experiment. 

We  also  plan  to  design  visualization  and  analysis  tools  that 
will  help  users  and  administrators  better  interpret  monitoring 
results,  including  history,  produced  by  Chowkidar. 
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